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Abstract 



Large-scale proteomic analysis is emerging as a powerful 
technique in biology and relies heavily on data acquired 
by state-of-the-art mass spectrometers. As with any other 
field in Systems Biology, computational tools are required 
to deal with this ocean of data. iTRAQ (isobaric Tags for 
Relative and Absolute quantification) is a technique that al- 
lows simultaneous quantification of proteins from multiple 
samples. Although iTRAQ data gives useful insights to the 
biologist, it is more complex to perform analysis and draw 
biological conclusions because of its multi-plexed design. 
One such problem is to find proteins that behave in a sim- 
ilar way (i.e. change in abundance) among various time 
points since the temporal variations in the proteomics data 
reveal important biological information. Distance based 
methods such as Euclidian distance or Pearson coefficient, 
and clustering techniques such as k-mean etc, are not able 
to take into account the temporal information of the series. 
In this paper, we present an linear-time algorithm for clus- 
tering similar patterns among various iTRAQ time course 
data irrespective of their absolute values. The algorithm, 
referred to as Temporal Pattern Mining(TPM), maps the 
data from a Cartesian plane to a discrete binary plane. Af- 
ter the mapping a dynamic programming technique allows 
mining of similar data elements that are temporally closer 
to each other. The proposed algorithm accurately clusters 
iTRAQ data that are temporally closer to each other with 
more than 99% accuracy. Experimental results for different 
problem sizes are analyzed in terms of quality of clusters, 
execution time and scalability for large data sets. An ex- 
ample from our proteomics data is provided at the end to 
demonstrate the performance of the algorithm and its abil- 
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ity to cluster temporal series irrespective of their distance 
from each other. 

1 Introduction 

Mass spectrometry is a fundamental part of any modern 
proteomics research platform for accurate protein identi- 
fication and quantification |T) |2) |3j. Mass spectrome- 
ters measure the mass-to-charge ratio(m/z) of ionized par- 
ticles Q. In the case of a typical LC -MS/MS proteomic 
experiment, the ionized particles (i.e. peptides) are intro- 
duced into a mass spectrometer at the ion source in the form 
of liquid solutions, then desolvated and transferred into the 
gas phase as gas phase ions. A variety of search algorithms 
are then used to match the peptide spectra to sequences in 
online databases in order to identify the proteins in the mix- 
ture (5) (6). 

iTRAQ (isobaric tags for relative and absolute quan- 
tification) is a technique used to identify and quantify pro- 
teins from different sources in one single experiment. It 
uses isotope coded covalent tags and is used to study quan- 
titative changes in the proteome |7][8j. The method is 
based on covalent labeling of the N-terminus and lysine 
side chains from protein digestions with tags of various 
masses for distinction. Up to 8 different tagging reagents 
(8-plex kit) are used to label peptides from different sam- 
ples. The samples are then pooled together as a single sam- 
ple and analyzed by mass spectrometer. The fragmentation 
of the attached tag generates a low molecular mass reporter 
ion which is useful in quantifying relative peptide abun- 
dance between the different iTRAQ channels. 

The iTRAQ technique allows analysis of samples in 
a more sophisticated and accurate manner in turn giving 
more relevant biological information such as phosphory- 
lation of peptides or the effect of vasopressin at different 
time points in the mass spec. Although the technique al- 
lows greater accuracy in quantitation, it raises many com- 



putational problems. One such problem is to identify the 
peptides that behave similarly for a given external agent 
e.g. dDAVP over the time course study. Time course mea- 
surements from iTRAQ data are becoming a common pro- 
cedure in many systems biology experiments |9| [ 10 1 [11]. 

If the experiment is subject to variations in time, the 
conventional methods to cluster and analyze the similarity 
such as Euclidean distance, Hamming distance have sig- 
nificant limitations. Likewise the clustering mechanisms 
that use distance based measures such as k-means fl2| , or 
hierarchical clustering (T3j do not always succeed when 
responses are highly variable in magnitude. Other meth- 
ods such as fuzzy clustering of short time-series [14| are 
not computationally efficient after a certain number of time 
courses due to the combinatorial explosion in possibilities. 
A facilitating characteristic of successful scalable cluster- 
ing is still to find a linear algorithm that involves a small 
number of passes over the database JT5| . 

In this paper we present a near-linear time cluster- 
ing algorithm that finds temporal patterns in a given large 
iTRAQ labeled dataset using one or small number of passes 
over the data and without compromising the quality of the 
clusters. Our algorithm draws its motivation from mapping 
problems in the parallel processing community 1 16] (T7J 
and quantization in information theory [18] fl9) . The pro- 
posed algorithm allows us to map the time points of a 
Cartesian plane into a discrete plane. The mapping then 
produces a predictable number of clustering possibilities 
that are then binned using a efficient dynamic programming 
technique to extract the patterns. 

The paper is organized as follows. In section 2 we 
provide the biological experimental details and the associ- 
ated computational problem. Section 3 discusses the pro- 
posed clustering algorithm and complexity analysis. Sec- 
tion 4 discusses the experimental results and illustrative ex- 
amples for our clusters. Finally, conclusions are presented 
in section 5 of the paper. 
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Figure 1 : The Flow diagram for the experiment 



the time course. An example of such is shown in the figure 
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VYEPLK 0.2 0.3 -0.1 0.5 
IHIDPE 0.1 0.2 0.4 0.6 
LEVAK 0.5 0.7 0.1 0.3 




Figure 2: The peptides with corresponding time scales and their 
values. The first and third peptides belong to the same cluster 



2 Problem Statement 

The objective of the study was to perform a quantita- 
tive comparison of protein phosphorylation under vaso- 
pressin(dDAVP) treatment using iTRAQ labels for differ- 
ent time points. LC-MS/MS 1 20 1 phosphoproteomics anal- 
ysis was performed and the time course clusters that would 
be obtained from the study provides the basis for modeling 
the signaling network involved. The flow diagram for the 
experiment is shown in Fig[T] 

In computational terms, the problem that we wish to 
solve is as follows. We are given a set of peptides with time 
points(ii, t% ■ ■ ■ etc) with each time point having a certain 
real value which in our case is the iTRAQ ratio. Given the 
peptides with time points and real values, we want to be 
able to cluster the peptides that give a similar pattern over 



The first column in figure [2] shows peptide sequences 
followed by 4 columns of iTRAQ ratios corresponding to 
the change in peptide abundance between the vasopressin 
and vehicle control samples ("dDAVP/vehicle"), each cor- 
responding to a different time point. Of course the number 
of columns would be dependent on the time courses that 
are considered and can increase or decrease depending on 
the particular experiment. Our objective is to determine the 
data points that have similar temporal patterns. It can be 
seen in the figure that peptide VYEPLK and LEVAK 
have similar patterns over the time course because both in- 
crease from point t\ to ti, decrease from t% to £3 and in- 
crease from i 3 to t±. The peptide IHIDPE on the other 
hand increases for all the time points. Hence, the peptides 
VYEPLK and LEVAK must be clustered together. 



Now let us formally define the problem. Let there be 
N data points with each data point having K time course 
values and X represents the peptide name. Then, let U 
present the set such that Ui — {X^, (ti, ■ ■ ■ , where 
1 < i < N. Also, let the number of clusters be Q. Then, 
cluster set C = {c±, c%, • • ■ , cq} where each Cj G Ui such 
that 1 < i < N and 1 < j < Q, Then each of the cluster 
Cj has the set of data points that have the same temporal 
pattern with time. 

The temporal similarity is defined as follows. Let c pq 
represent q th data point in cluster p where 1 < p < Q and 
1 < q < N. Now let c pq (x) represent the mapped data 
point at point x. Let there be another mapped data point 
c pq (y) at point y and Vx < y. Now define an array 
where 1 < h < K. Each point in array 

rh = c pq (y) - c pq (x) (1) 

Then, the data points that have strictly equal would 
be considered temporally similar. 

3 Proposed Mining Algorithm 

In this section, we present details of the proposed mining 
strategy, referred to as Temporal Pattern Mining (TPM). 
We also analyze the computational complexities of the pro- 
posed algorithm. 

The proposed algorithm TPM draws its motivation 
from the mapping problem in parallel and distributed com- 
puting (16) (IT] (H) and information theory (T8][l9). The 
mapping problem in parallel computing and mapping for 
our mining algorithm share a similar characteristic. In map- 
ping for parallel processing, two sets of nodes are consid- 
ered: problem modules and processor modules. The objec- 
tive is to map the problem modules on to processor mod- 
ules in an efficient manner. In mapping for the mining 
algorithm we are seeking a mapping such that the Carte- 
sian plane of the data points are mapped onto a discrete 
plane(e.g. binary plane) of finite possibilities. The discrete 
plane is defined by using the Nyquist sampling technique 
that allows conservation of the information from a contin- 
uous signal (or a data set with real numbers). These finite 
possibilities, which can grow exponentially with increasing 
time courses, are then mined using our efficient dynamic 
programming technique. AlgorithmT|gives an intuitive de- 
scription of the strategy. 

3.1 Mapping from Cartesian to discrete 
plane 

The proposed algorithm TPM can be classified as a fea- 
ture extraction algorithm (22). As defined in the problem 
statement in the section above we have a number of pep- 
tides with associated iTRAQ ratio values and we are inter- 



Data: A set Ui = {X i: (ti, • • • , tx)} of peptides and 






their time courses 


Result: Compute the cluster set C = {c\, C2, • • • , cq} 






such that the clusters have temporal similarity 






within the distinct clusters 


for i 


= 1 to K do 




Compute A[i] :— mapping(Ui) from Cartesian to 




discrete plane 


end 






while there are values in A that are not null do 




pick a random value from A call it A R 




count++ 




for j 


= to N do 






distance = EditDistance(A R , A[j]) 






if distance == then 
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/* This is to eliminate 








values from A that have 








already been assigned to 








a cluster */ 








A[j] <- NULL 






end 




end 




end 







Algorithm 1: Mapping based temporal pattern mining 
algorithm 



Data: A data point in the Cartesian plane 


Result: Return the discrete plane representation of the 






data point 


mapping(datapoint U w ) 


Vector V 




for w = 


to U w .length() do 




if w= 


Othen 






/ 


* save as name of the peptide 
*/ 






else 








if U w+ i — U w > then 








1 V w = a 








end 








else 








1 V w = b 








end 






end 




end 




end 






return V 



Algorithm 2: Mapping function 



Data: Two strings of length m and n 
Result: LevenshteinDistance between the two string is 
returned 

/* d is a table with m+1 rows and 

n+1 columns */ 
EditDistance(char s[l..m], char t[l..n]) 
int d[0..m, 0..n] <- 
for i from to m do 

| d[i, 0]:=i/* deletion */ 
end 

for j from to n do 

/ * insertion */ 
d[0,j]:=j 
end 

for j from 1 to n do 
for i from 1 to m do 
if s[i] = t[j] then 

| d[i,j]:=d[i-l,j-l] 
end 
else 

d[i, j] := minimum{ d[i-l, j] + 1, d[i, j-1] + 
l,d[i-l,j-l] + l) 

end 

end 
end 

return d[m,n] 



Algorithm 3: Dynamic programming cluster extraction 
subroutine 



ested in extracting clusters of the peptides that give similar 
expression levels(falling or rising) at different time points. 
However, k-means or hierarchical clustering cannot be used 
because the time points may be closer to each other in Eu- 
clidean distance, but may not be close in temporal changes 
over the time course. 

Clustering using the real values from the Cartesian 
plane however, is not feasible because of its continuous 
nature (infinite values). Therefore, with Cartesian co- 
ordinates the number of possible cluster combinations will 
be infinite in nature and clustering for all possible com- 
binations is not computationally feasible. Thus, a Carte- 
sian plane co-ordinates has to be mapped to a more dis- 
crete plane co-ordinates to restrict the number of combina- 
tions. The mapping function should be such that it would 
allow us to quantify the variations in the data with respect to 
time and also make the values discrete enough such that the 
number of combinations that are possible would decrease 
drastically. 

To address these challenges, the mapping function 
that is presented allows us to make the values more discrete 
and also conserves the important information of expression 
levels between the time periods. Using the same notation 
that we presented in the previous section. Let U present 
the set of values such that Ui — {Xi, (ti, ■ ■ ■ , tjc )} where 
\ < i < N . Xi represents the peptides and t\, ■ ■ ■ , i# rep- 
resents that values of the expression level from 1 to K. The 
mapping from the Cartesian plane would be accomplished 
as follows: 



[M(x, y) 



a if (t x < t y )and(x < y)and(y — x = 1); 
b o.w. 

(2) 

The assumption in the mapping function is that the 
first value is zero. However, this is an assumption that is 
appropriate for our data but is a not a generalized rule and 
can be changed accordingly. The mapping function in its 
functionality is simple and does the following. It looks for 
the data points at the next time point. If the current data 
point is below the previous value it is assigned 'a'. Oth- 
erwise it is assigned a 'b'. Therefore, after the mapping 
has been completed, each of the time series would be a 
sequence of a's and b's with each of the characters repre- 
senting the rise or fall in the expression level. The number 
of discrete levels can change according to the biological 
system under consideration. 

The mapping function has been defined in a way that 
makes a clear distinction between the data points that have 
similar temporal patterns and the ones that don't. Figure [3] 
shows that data points A(red) and B(black) are very close 
to each other in distance. However, in our mining scenario 
we are more interested in the pattern of the expression lev- 
els over time and in consideration of our criterion, the data 
point C( blue) is closer to B(black). If a naive k-means 




Algorithm 3. The crux of the technique is based on a dy- 
namic programming edit distance algorithm that allow us 
to calculate the 'distance' of a particular data point from 
another. We randomly pick a data point from the mapped 
data values. The randomly mapped data point is then used 
to calculate the levenstein distance with other data points. 
The data values that have zero distance with one another 
belong to the same cluster, because they would have the 
same pattern. The technique is very efficient in practice be- 
cause for large number of time points, the number of clus- 
ters that are actually present in the data is far less than the 
possible number of clusters. Thus, the computational com- 
plexity is greatly decreased and makes the system more ef- 
ficient. 



Figure 3 : The mapping of a Cartesian plane to a discrete mapped 
plane 



clustering would be executed, data points A and B would 
be clustered together. Using our mapping function how- 
ever, we are able to make a clear distinction irrespective 
of the Euclidean distance of the data points with respect to 
each other. Observe that the mapping of co-ordinates that is 
produced by our algorithm is same for the data points that 
have similar temporal behaviors (i.e. B and C). Therefore, 
once the mapping has been accomplished, the data points 
that have the same mapping belong to a single cluster. The 
psuedocode for the mapping function is given in Algorithm 

m 
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Figure 4: Extraction of clusters after mapping is complete 



3.2 Efficient Extraction of Clusters 

Once the mapping is complete, the next step is to mine 
the clusters that are present in the data set. An efficient 
way is needed to extract the clusters from the mapped data 
because the number of clusters can increase exponentially 
with increase in the time points as well as the number of 
discrete levels desired. Consider, for example, two discrete 
levels are considered as in our case. The number of clus- 
ters would be upper bounded by the order of Z N (for Z 
states or 2 N for two states), but may or may not be present 
in the data [23|. Even for a moderately large N the num- 
ber of combinations are huge and for each data point going 
through all of the possible combinations is waste of pre- 
cious computing resources. For TPM we developed an effi- 
cient technique that allows us to keep the search space con- 
fined to the clusters that are present in the dataset. There- 
fore, using our technique the exponential search space of 
possible clusters is minimized to the set of clusters that 
are present in the data set, saving valuable computing re- 
sources. 

The procedure to mine the relevant clusters shown in 



Figure [4] shows the clusters are extracted from the set 
in only small finite operations as compared to large num- 
ber of operations that would be required if all the possible 
combinations would be considered. The figure shows 27 
time points for each data set that has been mapped using 
our mapping strategy. The total number of possible combi- 
nations are 2 27 for each of the data points making total op- 
erations equal to 23 x 2 27 i.e. 0(N2 N ). However, consid- 
ering our strategy, we pick one of the data points randomly. 
In the figure, data points (time courses) in black font are 
picked and the edit distance is calculated for each of the 
data points. These data points would get a zero distance 
and would be accumulated in a single cluster c\. After, 
adding them to the cluster set these black data points are 
excluded from the search and the process is repeated until 
only one kind of data points is present. The total number 
of operations is equal to 23 x 4 i.e. 0(QN) where Q is 
the number of clusters and N is the number of elements 
present. It is very apparent in fig [4] that only 3 passes are 
needed to extract all the clusters from the data set. 

The extraction of the clusters works very well in practice for 



a two-state dataset. However, it can be further improved by using 
kd-tree data structure for extraction of clusters for higher dimen- 
sional states 1 24 ] |25| [26|. The k-d tree is a multidimensional bi- 
nary search mechanism that represents a recursive subdivision of 
the data space into disjoint subspaces by means of d-dimensional 
hyper planes. The root of the tree then represents all the patterns, 
while the children of the root represents subsets of patterns in 
the subspaces. Searching for the clusters for the algorithm can 
then be performed in o(QN ( - 1 - 1/No - ofstates ' ) ). Note that as the 
number of states increase, the searching time would reduce corre- 
spondingly because of the division of subspaces using the hyper 
planes. 

3.3 Complexity Analysis 

The time complexity of the algorithm can be broken down into 
two parts. The first part is for the mapping and the second part 
for the extraction of the clusters. The mapping part has complex- 
ity of O(KN) since there are N data elements and each is of 
length K. The second part of the algorithm is for the extraction 
of the clusters. This part run Q times where Q is the number of 
clusters present in the data. For each run, dynamic programming 
algorithm is executed, which runs in 0(K 2 ) time; assumption is 
that both of the data points that are being compared are equal in 
length. This procedure runs for Q times giving the complexity 
of 0(QK 2 ). Thus the total time complexity of the algorithm is 
T(.) = 0(QK 2 ) + 0(KN). 



the number of clusters(Q) and the length of the time points(K). 
We generated our data set for timing evaluation by replicating the 
data that we got from our biological experiment. The ratio of 
the number of clusters and the length of the time points remained 
constant whenever we replicated our data by a real positive whole 
factor. We tested the timings for data points from 3500 to 80, 000 
elements. 
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Figure 5: Timing with increasing number of data points 



4 Performance Evaluation 

Performance evaluation was done for the quality of the clus- 
ters that were extracted as well as the efficiency of the tech- 
nique. We tested the algorithm with data from a large-scale quan- 
titative phosphoproteomics experiments done as follows: Inner 
medullary collecting duct (IMCD) samples were incubated in the 
presence or absence of InM dDAVP (vasopressin) for 0.5, 2, 5, 
and 15 minutes (N=3) followed by LC-MS/MS-based phospho- 
proteomic analysis. Quantification used 8-plex iTRAQ and com- 
mercially available software. These phosphopeptides were ana- 
lyzed with our algorithm in order to identify groups that changed 
in abundance with similar temporal responses after exposure to 
vasopressin. The algorithm identified 16 clusters of phosphopep- 
tides with distinct temporal profiles. These time-course clusters 
provide a starting point for modeling of the signaling network in- 
volved. The algorithrrQias been implemented in Java(TM) SE 
Runtime Environment (buildl.6.0). The experiments were con- 
ducted on a Dell server consisting of 2 Intel Xeon(R) Processors, 
each running 2.40 GHz, with 12000 KB cache and 64GB DRAM 
memory. The operating system on the server is Linux RedHat 
enterprise version with kernel 2.6.9-89.ELlargesmip. 

For the timing experiments we used the same data that we 
got from our biological experiments. In order to access the timing 
for the algorithm, we generated the data as follows. The com- 
plexity analysis suggested that the algorithm must exhibit a linear 
time with increasing data points. Therefore, we wanted to ac- 
cess the timing by keeping other variables relatively constant i.e. 
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An executable can be obtained by requesting the author; A webser- 
vice will also be available at authors' page. 
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Figure 6: Timing with increasing number of time points 



The timings for the algorithm with increasing number of 
data points is shown in Fig. |5]for up to 80000. The timings ob- 
served have a linear trend with increasing number of data points 
as predicted by our complexity analysis. Even for up to 80000 
elements the timings observed are no more than 3 seconds. Ob- 
serve, as the number of clusters decrease, the times observed for 
the same number of data points have a decreasing trend. The rea- 




ters that were identified varied according to the algorithm used 
because of the high variability in the data. For the cluster that 
were identified, above mentioned criteria was used to distinguish 
the correct clusters from the incorrect one. The results obtained 
matched well with the previous studies e.g. In 1 1 l| the authors 
using self organizing maps(SOM) were only able to use 9 clusters 
that were sufficiently accurate to be used for quantification stud- 
ies thus limiting the biological meaningful data analysis. Using 
our algorithm, it can be seen that the number of clusters observed 
were in agreement with our theoretical analysis. All of the data 
points that were included in the clusters didn't had a variation 
more than 1% making it a highly accurate clustering algorithm 
for iTRAQ labeled protein quantification analysis. 
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Figure 7: Timing with increasing number of time points 



Table 1 : Number of correct clusters with different algorithms 



Algorithm 


Clusters Identified 


Correct Clusters 


K-means 


16 


3 


Hierarchical 


18 


6 


SOM 


20 


10 


TPM 


16 


16 



son is that for fewer clusters, the number of passes that have to 
be made to extract cluster pattern is smaller further decreasing the 
computation time. It is also useful to know the behavior of the al- 
gorithm with increasing number of time points(K) in the data. Fig. 
[6] shows the timings of the algorithm with increasing number of 
time points(K) for up to 100 for variable number of clusters. The 
timings observed are in accordance with our complexity analy- 
sis that suggested a 0(K 2 ) behavior and time remains under 500 
seconds for clustering data sets that have 100 time points. Figure 
[7] also shows the timings with increasing number of time points 
while varying the number of data points. As can be seen the algo- 
rithm exhibits a very consistent behavior with increasing number 
of time points as well as increasing number of data points. The 
constants used during these experiments are K = 4, Q = 30 and 
N = 3500 wherever appropriate. 

The quality of the clustering was then compared with other 
algorithms such as kmeans 1 10 1 1 12], Self organizing maps 1 1 1 1 
|27| , and Hierarchical clustering [9| |28| [29]. A brief summary 
of the results from these experiments are in table [T] The assess- 
ment of the quality was done as follows. The standard algorithms 
used as described in 1 10] ] 1 1 1 (9) on our dataset which are the 
same iTRAQ labeled data that has been described in the litera- 
ture. For the clusters that were identified by these algorithms, if 
there were more than 1% of data points that were incorrect it is 
labeled as incorrect cluster. Although there are many data points 
in the cluster that are similar to one another; data points that are 
incorrectly clustered can have a serious impact on the quantifica- 
tion of the proteins. As shown in the table, the number of clus- 
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Figure 8: Example of clustered data points 
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Figure 9: Example 2 of clustered data points 



We were able to identify 16 distinct clusters for phospho- 
peptides with distinct temporal profiles. We also performed sub- 




2 4 6 8 10 12 14 16 
Time in minutes 

Figure 10: Example 3 of clustered data points 



clustering of the data purely for biological analysis i.e. for quan- 
tification we wanted separate clusters for the ratios that are nega- 
tive all over the time course, the ones that were positive and the 
ones that crossed from positive to negative or negative or positive 
over the length of the experiment. Some examples of the clus- 
ters are shown in Figs. [8| |9| and[T0|each line presenting a peptide. 
As shown in Fig. [8] even though the data points have high vari- 
ability in terms of distance of the points, the points have been 
clustered accurately with respect to the pattern of the response. 
For all of the data points that were clustered in Fig[8] the ratios 
first decreased then increased and decreased further in the last 
time point. Same can be observed for Fig. [9] and [10] that even 
though the points have a lot of variability in terms of distance of 
the points, they are clustered correctly in terms of patterns over 
the time. 



5 Conclusion 

We developed a new algorithm called TPM for clustering time- 
course patterns following step-inputs in biological systems for 
iTRAQ labeled phosphopeptides. We tested the algorithm with 
data from large-scale quantitative phosphoproteomics experi- 
ments for up to 80, 000 data points and up to 100 time point inter- 
vals. Quantification used 8-plex iTRAQ and commercially avail- 
able software. These phosphopeptides were analyzed with our 
algorithm in order to identify groups that changed in abundance 
with similar temporal responses after vasopressin addition. The 
algorithm maps the data from a Cartesian plane to a discrete bi- 
nary plane and uses an efficient dynamic programming technique 
to mine similar patterns after mapping. The mapping allows clus- 
tering of similar time courses that are temporally closer to each 
other. The algorithm identified 16 clusters of phosphopeptides 
with distinct temporal profiles in response to vasopressin. The al- 
gorithm was also compared for quality to other standard clustering 
techniques that have been used for similar experiments in the lit- 
erature. It was shown that the proposed algorithm performed with 



significantly better accuracy (at least 99% of clusters were as- 
signed correctly) with the ability to handle large data sets. These 
time-course clusters provide a starting point for modeling of the 
signaling network. We believe that the proposed algorithm will 
prove useful to the computational biology and mass spectrometry 
community. 
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